Method and system of personalized blending for content recommendation

ABSTRACT

The present teaching relates to personalized content recommendation. A webpage is contrasted for a user having a plurality of slots each of which is to be allocated with a content item. For each of the plurality of slots, a plurality of content items in a plurality of types of content are accessed. For each of the plurality of types of content, a personalized score is predicted for each content item in the type of content, wherein the personalized score is obtained based on a trained model trained. A recommended content item of the type of content is selected based on personalized scores. An overall recommended content item is selected and allocated to a slot based on criteria associated with the personalized scores of the recommended content items and a business rule. The webpage with the plurality of slots allocated with content items is provided to the user.

CROSS REFERENCE TO RELATED APPLICATION

The present application is a continuation of U.S. patent applicationSer. No. 16/712,278, filed on Dec. 12, 2019, entitled “METHOD AND SYSTEMOF PERSONALIZED BLENDING FOR CONTENT RECOMMENDATION”, the contents ofwhich are hereby incorporated by reference in its entirety.

BACKGROUND 1. Technical Field

The present teaching generally relates to methods, systems, andprogramming for content personalization. Particularly, the presentteaching is directed to methods, systems and programming for blendingcontent items from different content corpora.

2. Technical Background

The Internet has made it possible for a person to electronically accessvirtually any content at any time and from any location. The Internettechnology facilitates information publishing, information sharing, anddata exchange in various spaces and among different persons. Typically,users issue a search query to a search engine to obtain desirablecontent. A search engine is one type of information retrieval systemthat is designed to help users search for and obtain access toinformation that is stored in a computer system or across a network ofcomputers. In response to a query from a user, the search engine cansearch different content providers online to obtain search resultsmatching the query. Content providers can be a publisher, a contentportal, or any other sources from which content can be obtained.

The ability to deliver personalized content is crucial to contentplatforms such as Yahoo! News, Google Finance, Facebook, etc., whichprovide rich content of different types and different topics. Forexample, the type of content provided could be articles, videos, etc.,wherein each type of content is obtained from a specific corpus (i.e.,database). Typically, while recommending content of different types, adedicated model associated with a type of corpus is utilized torecommend content from that corpus. For organizations that want to havetheir corresponding web sites perform in an optimal manner,understanding personalized content blending and how the web sitecombines all of the different types of recommendations is critical.Personalized blending for content recommendation faces the problem(s) ofmerging heterogeneous contents recommended from specialized corpora intoa single result set to serve the user's information need and maximizethe user's satisfaction. The goodness of a blending operation ismeasured by a user satisfaction metric, i.e., a better blending willresult in a higher level of user satisfaction.

It must be appreciated that the problem of blending content items fromdifferent content corpora is different from a traditional contentrecommendation of homogenous content. In the traditional contentrecommendation scenario, content items from the same corpus (i.e.,content items that are to be recommended to the user) share the sameproperties as compared with each other. In contrast, in personalizedcontent blending, even though each content corpus has more or less astructured representation, it is very unlikely that the content corpusshares common features with other content corpora due to theheterogeneous nature of different types of content. Accordingly, how tocompare the relevance of content items from different corpora and blendthem in a correct manner (which increases user's satisfaction level)becomes particularly challenging. As such, one cannot directly apply thesame mechanism used in homogeneous contents recommendation to solve theproblem of blending different types of content items.

Traditionally, in order to solve the above described problem ofpersonalized content blending, one of two approaches is implemented: afixed ratio approach or a calibration approach. However, both approachesare neither data driven nor personalized. In the fixed ratio blendingapproach, a predetermined number of content items of a first type (e.g.,video) with respect to content items of another type (e.g., images) isused to provide the user with the different types of content items. Sucha fixed ratio approach is not a desirable solution as it does notaddress personalization. Specifically, some users may want to see morevideos, while other users may want to see a smaller number of videos.With regard to the calibration approach, availability of a ground truthis typically assumed. Specifically, it is assumed that there is a groundtruth available such that recommendation scores from different corpuscan be calibrated against the ground truth. For example, score with 0.8value from a video recommendation model is comparable with a score of0.75 from an article recommendation model. However, in practice, it isdifficult to find a ground truth against which content items fromdifferent corpora can be calibrated against. Moreover, such acalibration approach also incurs the drawback of scalability.Accordingly, there is a need to devise a solution to address the abovestated problems.

SUMMARY

The teachings disclosed herein relate to methods, systems, andprogramming for content personalization. Particularly, the presentteaching is directed to methods, systems and programming for blendingcontent items from different content corpora.

One aspect of the present disclosure provides for a method, implementedon a machine having at least one processor, storage, and a communicationplatform capable of connecting to a network for personalized contentrecommendation. The method comprises the steps of: constructing awebpage for a user having a plurality of slots each of which is to beallocated with a content item; for each of the plurality of slots,accessing a plurality of content items in a plurality of types ofcontent, for each of the plurality of types of content, predicting apersonalized score for each content item in the type of content, whereinthe personalized score represents an estimated level of satisfactionwhen the content item is recommended to the user and is obtained basedon a model trained using training data associated with the user, andselecting a recommended content item of the type of content based onpersonalized scores of the content items of the type, selecting anoverall recommended content item from the recommended content items ofthe plurality of types of content based on criteria associated with thepersonalized scores of the recommended content items and a businessrule, and allocating the overall recommended content item to the slot;and providing the webpage with the plurality of slots allocated withcontent items to the user.

By one aspect of the present disclosure, there is provided a system forcontent personalization. The system comprises at least one processorconfigured to: construct a webpage for a user having a plurality ofslots each of which is to be allocated with a content item; for each ofthe plurality of slots, access a plurality of content items in aplurality of types of content, for each of the plurality of types ofcontent, predict a personalized score for each content item in the typeof content, wherein the personalized score represents an estimated levelof satisfaction when the content item is recommended to the user and isobtained based on a model trained using training data associated withthe user, and select a recommended content item of the type of contentbased on personalized scores of the content items of the type, select anoverall recommended content item from the recommended content items ofthe plurality of types of content based on criteria associated with thepersonalized scores of the recommended content items and a businessrule, and allocate the overall recommended content item to the slot; andprovide the webpage with the plurality of slots allocated with contentitems to the user.

Other concepts relate to software for implementing the present teaching.A software product, in accord with this concept, includes at least onemachine-readable non-transitory medium and information carried by themedium. The information carried by the medium may be executable programcode data, parameters in association with the executable program code,and/or information related to a user, a request, content, or otheradditional information.

In one example, there is provided, a non-transitory computer readablemedium including computer executable instructions, wherein theinstructions, when executed by a computer, cause the computer to performa method for personalized content recommendation. The method comprisesthe steps of: constructing a webpage for a user having a plurality ofslots each of which is to be allocated with a content item; for each ofthe plurality of slots, accessing a plurality of content items in aplurality of types of content, for each of the plurality of types ofcontent, predicting a personalized score for each content item in thetype of content, wherein the personalized score represents an estimatedlevel of satisfaction when the content item is recommended to the userand is obtained based on a model trained using training data associatedwith the user, and selecting a recommended content item of the type ofcontent based on personalized scores of the content items of the type,selecting an overall recommended content item from the recommendedcontent items of the plurality of types of content based on criteriaassociated with the personalized scores of the recommended content itemsand a business rule, and allocating the overall recommended content itemto the slot; and providing the webpage with the plurality of slotsallocated with content items to the user.

Additional advantages and novel features will be set forth in part inthe description which follows, and in part will become apparent to thoseskilled in the art upon examination of the following and theaccompanying drawings or may be learned by production or operation ofthe examples. The advantages of the present teachings may be realizedand attained by practice or use of various aspects of the methodologies,instrumentalities and combinations set forth in the detailed examplesdiscussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The methods, systems and/or programming described herein are furtherdescribed in terms of exemplary embodiments. These exemplary embodimentsare described in detail with reference to the drawings. Theseembodiments are non-limiting exemplary embodiments, in which likereference numerals represent similar structures throughout the severalviews of the drawings, and wherein:

FIG. 1 illustrates an exemplary system configuration in which a contentblending engine can be deployed, according to an embodiment of thepresent teaching;

FIG. 2 illustrates another exemplary system configuration in which acontent blending engine can be deployed, according to an embodiment ofthe present teaching;

FIG. 3 depicts an exemplary system diagram of a content blending engine,according to various embodiments of the present teaching;

FIG. 4 is a flowchart of an exemplary process performed by a contentblending engine, according to various embodiments of the presentteaching;

FIG. 5 depicts an exemplary system diagram of a training engine fortraining a reward predictor, according to various embodiments of thepresent teaching;

FIG. 6 is a flowchart of an exemplary process performed by a trainingengine to train a reward predictor, according to various embodiments ofthe present teaching;

FIG. 7 depicts an exemplary system diagram of a training data generator,according to various embodiments of the present teaching;

FIG. 8 is a flowchart of an exemplary process performed by a trainingdata generator, according to various embodiments of the presentteaching;

FIG. 9A depicts exemplary contextual information according to variousembodiments of the present teaching;

FIG. 9B depicts exemplary content types according to various embodimentsof the present teaching;

FIG. 10 depicts an architecture of a mobile device which can be used toimplement a specialized system incorporating the present teaching; and

FIG. 11 depicts the architecture of a computer which can be used toimplement a specialized system incorporating the present teaching.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are setforth by way of examples in order to provide a thorough understanding ofthe relevant teachings. However, it should be apparent to those skilledin the art that the present teachings may be practiced without suchdetails. In other instances, well known methods, procedures, components,and/or circuitry have been described at a relatively high-level, withoutdetail, in order to avoid unnecessarily obscuring aspects of the presentteachings.

Subject matter will now be described more fully hereinafter withreference to the accompanying drawings, which form a part hereof, andwhich show, by way of illustration, specific example embodiments.Subject matter may, however, be embodied in a variety of different formsand, therefore, covered or claimed subject matter is intended to beconstrued as not being limited to any example embodiments set forthherein. Example embodiments are provided merely to be illustrative.Likewise, a reasonably broad scope for claimed or covered subject matteris intended. Among other things, for example, subject matter may beembodied as methods, devices, components, or systems. Accordingly,embodiments may, for example, take the form of hardware, software,firmware or any combination thereof (other than software per se). Thefollowing detailed description is, therefore, not intended to be takenin a limiting sense.

Throughout the specification and claims, terms may have nuanced meaningssuggested or implied in context beyond an explicitly stated meaning.Likewise, the phrase “in one embodiment” as used herein does notnecessarily refer to the same embodiment and the phrase “in anotherembodiment” as used herein does not necessarily refer to a differentembodiment. It is intended, for example, that claimed subject matterinclude combinations of example embodiments in whole or in part.

In general, terminology may be understood at least in part from usage incontext. For example, terms, such as “and”, “or”, or “and/or,” as usedherein may include a variety of meanings that may depend at least inpart upon the context in which such terms are used. Typically, “or” ifused to associate a list, such as A, B or C, is intended to mean A, B,and C, here used in the inclusive sense, as well as A, B or C, here usedin the exclusive sense. In addition, the term “one or more” as usedherein, depending at least in part upon context, may be used to describeany feature, structure, or characteristic in a singular sense or may beused to describe combinations of features, structures or characteristicsin a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again,may be understood to convey a singular usage or to convey a pluralusage, depending at least in part upon context. In addition, the term“based on” may be understood as not necessarily intended to convey anexclusive set of factors and may, instead, allow for existence ofadditional factors not necessarily expressly described, again, dependingat least in part on context.

The problem of personalized content blending can be broadly defined asdetermining an optimal manner of aggregating content items of differenttypes to be provided to a user. Described herein is a method, system,and a computer program product to solve the problem of personalizedblending for content recommendation. As heterogeneous results arecombined with one another, it is important to present them in a propermanner so as to maximize user satisfaction. Despite a few recentadvances, there are still a few challenges to address this problem: (a)Heterogeneous nature of different corpora—a manner in which eachdifferent corpora recommends its content items may be unknown to eachother. Furthermore, the recommendation score returned from the differentcorpus are not comparable to one another. Moreover, the scale of corporaand the increasing number of content properties requires a solutionwhich is efficient and scalable; (b) User satisfaction definition—it israther unclear how to determine a scoring function that maps the blendedcontent presentation to user satisfaction for each particular user.Determining the right user satisfaction metric based on user responsesis not a trivial task. Typically, the user satisfaction metric isheuristically defined as an aggregation of fine-grained user responses,such as click-through rates and long-dwell-time clicks; and (c) Trainingdata collection—model-based approaches have become popular in recentyears, and machine learning techniques have been used to fit the models.A new challenge that is posed with such approaches is how to gatherenough data with labels to train the model. Hiring editors to judge theresults can be expensive and is not scalable. Getting judgements isdifficult and can be biased if personalization needs to be considered.Different from the traditional recommendation problems, wherein contentitems of the same type of corpus are to be presented to users, in theblending content items problem, if the content items of different typeshas not been blended to be shown to users, the recommendation systemwill receive no feedback, and thus will have no knowledge as to whetherthis type of content matches the user's interest or if it is preferredover other types of content.

Accordingly, by one embodiment of the present disclosure, a guidedexploration technique is provided in order to reduce the impact ofexploration on user experience and make efficient use of the limitedamount of traffic to collect user feedback on data. Further, to addressthe challenges of aggregating contents recommended from heterogeneouscorpus, the present disclosure provides for a personalized contentblending approach in a unified and principled manner. Specifically, byone embodiment, the problem of personalized content blending isformulated as a multi-armed contextual bandit (MA-CB) problem. The MA-CBproblem models a situation where, in a sequence of independent trials, amodel chooses, based on a given contextual information, an action from aset of possible actions so as to maximize total rewards of the chosenactions. The reward depends on both the action chosen and the context.

As will be described later, each type of corpus is modeled as a banditand reward is used as a proxy for user satisfaction. It must beappreciated that a higher amount of rewards corresponds to a higherlevel of user satisfaction. Rewards depend on how engaged users are withthe recommendation system. Examples of user engagement may be click-skipbased or dwell time based user interactions. Moreover, an action (withregard to the MA-CB problem) corresponds to selecting a type of contentitem from the different types of content corpus and allocating theselected content item at a position (i.e., a slot) on a webpage.

According to one embodiment, the recommendation system learns a rewardpredictor from the training data collected from guided explorationstrategy. It must be appreciated that the reward predictor is a scoringfunction that maps content items to a user satisfaction metric.Thereafter, given a user, the reward predictor can estimate the expectedreward for each content item (of different types or different topics),and organize the blending result in a way which maximizes the totalreward. Specifically, as will be described later, by one embodiment, agreedy slotting technique is employed to allocate content items in eachslot of a webpage. Thus, in contrast to the scenario of relying on humanjudgements, embodiments as described in the present disclosure leverageimplicit user feedback in order to learn the model. Such a method isefficient to implement and can be applied to corpora of differentnature.

Turning now to FIG. 1, there is illustrated an exemplary systemconfiguration in which a content blending engine 140 can be deployed inaccordance with various embodiments of the present teaching. In theconfiguration 100 depicted in FIG. 1, the content blending engine 140 isdirectly connected to a network 120 and operates as an independentservice on the network. In FIG. 1, the exemplary system 100 includesusers 110, a network 120, a content server 130, the content blendingengine 140, a database 150, and content corpora 160, including contentcorpus 1 160-a, content corpus 2 160-b, . . . , and content corpus N160-c.

The network 120 may be a single network or a combination of differentnetworks. For example, the network may be a local area network (LAN), awide area network (WAN), a public network, a private network, aproprietary network, a Public Telephone Switched Network (PSTN), theInternet, a wireless network, a cellular network, a virtual network, orany combination thereof. A network may also include various networkaccess points, e.g., wired or wireless access points such as basestations or Internet exchange points 120-a, . . . , 120-b, through whicha data source may connect to the network 120 in order to transmitinformation via the network and a network node may connect to thenetwork 120 in order to receive information. In one embodiment, thenetwork 120 may be a content distribution network that connects users110 to the content server 130 and the content blending engine 140, whichprovide the users with a mix (i.e., blend) of relevant content itemsobtained from the content corpora 160.

Users 110 may be of different types such as users connected to thenetwork via desktop connections (110-4), users connecting to the networkvia wireless connections such as through a laptop (110-3), a handhelddevice (110-1), or a built-in device in a motor vehicle (110-2). A usermay send a query to the content server 130 or the content blendingengine 140 via the network 120 and receive (as response) a correspondingsearch result (e.g., webpage including blended content items) throughthe network 120. By one embodiment, a user's query may be directed tothe content server 130. Alternatively, in some embodiments, the querymay be directed directly to the content blending engine 140.Accordingly, the user's query may be handled by either the contentserver 130 or the content blending engine 140, both of which may searchfor content items from the content corpora 160 that are to be blendedi.e., inserted on the webpage to be rendered to the user.

Content corpora 160, may correspond to an entity, whether an individual,a firm, or an organization, publishing or supplying content, including ablogger, television station, a newspaper issuer, a web page host, acontent portal, an online service provider, or a game server. Forexample, in connection to an online or mobile ad network, content source160 may be an organization such as CNN.com, a content portal such asYouTube and Yahoo.com, or a content-soliciting/feeding source such asTwitter or blogs. By one embodiment of the present disclosure, eachcontent corpus stores a specific type of content items. For example,content corpus 1 160 a may store content items of type images, contentcorpus 2 160 b may store content items of type videos, content corpusN160 c may store content items of type articles/document etc.

As stated previously, the embodiment illustrated in FIG. 1 includes thecontent blending engine 140 operating as an independent service (i.e., astandalone service) on the network 120. In FIG. 2, an alternateconfiguration 200 is provided, wherein the content blending engine 140operates as a special module in the backend of the content server 130.When there are multiple content servers (not shown), each may have itsown backend module for content processing/blending purposes.Nonetheless, it must be appreciated that the content blending engine 140as shown in FIG. 2 performs functions similar to those described abovewith reference to FIG. 1. In what follows, there is provided a detaileddescription of the processing performed by the content blending engine140.

FIG. 3 depicts an exemplary system diagram of a content blending engine140, according to various embodiments of the present teaching. Thecontent blending engine 140 comprises a content corpora 310 includingcontent corpus 1 310-a, content corpus 2 310-b, and content corpus K310-c. The different types of content items included in the contentcorpora 310 may include content types such as document/articles, images,videos, slideshows, etc., as shown in FIG. 9B. The content blendingengine 140 further includes a content retrieving unit 320, a rewardestimating unit 330, a content selecting unit 340, a validation unit350, a content rendering unit 360, and a trained scoring function 335.

By one embodiment, the content blending engine 140 performs blending ofdifferent types of content items to be provided to a user through adata-driven machine learning process. The content blending engine 140collects training data records from a guided exploration strategy andlearns a reward predictor, i.e., a model, which is a scoring functionthat maps blended content items to a user satisfaction metric. Given auser, the reward predictor can estimate the expected reward for eachcontent item (of different types or different topics) and optimize thetotal rewards of a blending in order to maximize user satisfaction.

As described below, upon obtaining a trained/learned scoring function(i.e., the reward predictor), the content blending engine 140 utilizes agreedy slotting mechanism to determine the final slotting of contentitems on a webpage. For each slot on a webpage that is to be allocatedwith a content item, the content retrieving unit 320 accesses aplurality of content items in a plurality of types of content from thecontent corpora 310. For each of the plurality of types of content(i.e., content from corpus 310-a, 310-b, 310-c, etc.), the rewardestimating unit 330 predicts a reward (i.e., a personalized score) foreach content item in the type of content. Note that the reward or thepersonalized score represents an estimated level of satisfaction whenthe content item is recommended to the user. The reward is predictedbased on the trained scoring function 335, i.e., a model trained usingtraining data associated with the user. Details pertaining to thetraining of the reward predictor are described next with reference toFIG. 5.

By one embodiment, the content selecting unit 340 is configured toselect a recommended content item of the type of content based onpersonalized scores of the content items of the type. Specifically, theselecting unit 340 is configured to select a content item from each ofthe corpus's 310-a, 310-b, and 310-c. The selected content item fromeach data corpus (i.e., the recommended content item from each datacorpus) corresponds to the content item having the highest reward(predicted score) among the content items of the data corpus.

Further, the selecting unit 340 is configured to select an overallrecommended content item from the recommended content items to beallocated to the slot under consideration. By one embodiment, theselecting unit 340 selects the overall recommended content item (fromthe recommended content items) as the one having the highest predictedscore. The selected overall recommended content item is validated by thevalidation unit 350 in accordance with rules 355. In response to asuccessful validation, the selected overall recommended content item isallocated to the slot. However, in response to an unsuccessfulvalidation, the selected overall recommended content item is discardedand the validation unit 350 instructs the selecting unit to select analternate overall recommended content item.

The rules 355 used by the validation unit 350 to validate a content itemmay correspond to business rules such as prohibiting a successivedisplay (i.e., in successive slots) of similar types of content items,prohibiting display of a certain type of content item at specific slotson the webpage, etc. It must be appreciated that the rules 355 maycorrespond to business rules and/or correspond to rules specified by theuser, for example, based on the user's preference. The validated contentitem is allocated to the slot under consideration. The process asdescribed above is then performed for the next slot on the webpage andis repeated until a stopping condition is satisfied e.g., all slots onthe webpage are allocated a content item respectively. In response tothe stopping condition being satisfied, the content rendering unit 360constructs a webpage with all the slots allocated with content items andprovides the webpage to the user. In this manner, the content blendingengine 140 maps heterogeneous items based on a commonly trained (withrespect to the different types of content items) reward predictor andallows one to compare the relevance and blend of the content items in aprincipled manner.

FIG. 4 is a flowchart of an exemplary process performed by a contentblending engine 140 according to various embodiments of the presentteaching. The process commences in step 410 wherein different types ofcontent items are retrieved from respective content corpus i.e., aplurality of content items in a plurality of types of content areaccessed.

In step 415, for each of the plurality of types of content, a reward ispredicted for each content item in the type of content. The reward ispredicted based on a trained scoring function. In step 420, a contentitem is selected based on the predicted rewards. Specifically, asdescribed above an overall recommended content item from recommendedcontent items of the plurality of types of content is selected. Theprocess then moves to step 425, wherein the overall recommended contentitem is validated in accordance with some rules.

In step 430, a query is performed to determine whether the selectedoverall recommended content item is valid. If the response to the queryis affirmative, the process moves to step 435, else is the response tothe query is negative, the process loops back to step 420. In step 435,the validated overall recommended content item is assigned to a slot (ona webpage) under consideration. Thereafter, in step 440, a further queryis performed to determine whether all slots have been allocated with acontent item respectively. If the response to the query is affirmative,the process moves to step 445, wherein a webpage including the contentitems allocated to the respective slots is rendered and provided to theuser. However, if the response to the query in step 440 is negative, theprocess simply loops back to step 420 to determine an overallrecommended content item to be allocated to the next slot on thewebpage.

Turning now to FIG. 5, there is depicted an exemplary system diagram ofa training engine for training a reward predictor, according to variousembodiments of the present teaching. As stated previously, according toone embodiment of the present disclosure, the problem of personalizedcontent blending is formulated as the multi-armed contextual bandit(MA-CB) problem. In the MA-CB problem formulation, each data corpus ismodeled as a bandit, and an action corresponds to selecting a contentitem from one of the data corpus. In what follows, there is described amechanism of training a reward predictor under the MA-CB setting.

As shown in FIG. 5, the training engine includes a learner 515, anenvironment 530, and a plurality of actions 510 including action 1510-a, action 2 510-b, action K 510-c. Note that each action correspondsto a data corpus. The learner unit 515 includes an action taking unit520, a reward learning unit 525, an update unit 535, and a policy 540.Over a predetermined number of training iterations, the interactionsbetween the learner unit 515 and the environment 530 results in atrained reward predictor i.e., scoring function 545.

By one embodiment of the present disclosure, the MA-CB problem can beformalized by defining a set of actions A (i.e., Action 1, Action 2, . .. Action K), a contextual vector for each iteration of the training, areward for each iteration, and a policy. It must be appreciated that thecontextual vector corresponds to environmental information. Exemplaryinformation included in the contextual vector can comprise of time,platform, location, user information, etc., as shown in FIG. 9A. Thereward is a proxy for a level of user satisfaction, and the policy ismapping from contextual information to an action.

The interactions between the learner and the environment can bedescribed as follows. In each iteration of the training, a contextualinformation is revealed to the learner 515 i.e., an observation isprovided by the environment 530 to the learner 515. Upon receiving anobservation, the action taking unit 520 of the learner 515, utilizes apolicy 540 to select an action from the set of actions 510. It must benoted that when the learner is provided an observation, the learner 515has no knowledge of the rewards associated with the actions. Only uponselecting a particular action, the reward learning unit 525 (of thelearner 515) receives reward information (pertaining to only theselected action) from the environment 530. Because of the fact that thelearner 515 does not receive any feedback for the unchosen actions, inorder to get a good knowledge of them, the learner 515 has to gatherinformation about the unchosen actions by choosing the unchosen actionsonce in a while. This kind of strategy is referred as exploration, whichis in contrast to the strategy of exploitation, which pertains toselecting an action from an already chosen action type. Explorationcould harm the received reward in short term as some suboptimal actionmay be chosen. On the other hand, obtaining information about the actionthrough exploration can help the learner get a better estimation of theactions' rewards, and in turn increase the long term overall reward.

The action taking unit 520 of the learner 515 selects an action from theset of actions 510 based on the policy 540. The policy 540 is thebehavior of the learner and affects the balance between exploitation andexploration. For instance, the learner 515 may start with randomexploration. After several iterations of random exploration, the learnerwill good estimate of rewards for each action. The learner 515 may theneither exploit (i.e., greedily select the action with the best estimatereward) or randomly try another action. The update unit 535 updates thepolicy based on the data (i.e., action and associated reward) learned ateach iteration. By one embodiment of the present disclosure, in order toefficiently balance between the exploration and exploitation strategies,a guided exploration strategy is provided. Specifically, training data550 generated based on user's previous interactions is provided to theupdate unit 535 to help the learner 515 update the policy 540. As willdescribed next with reference to FIG. 7, weighted training data(associated with the user) is generated to provide a guided explorationmechanism to the learner 515 in order to understand the user'spreference. In this manner, the scoring function 545 i.e., the rewardpredictor is trained using data collected at each iteration of trainingi.e., via exploration and/or exploitation strategies.

FIG. 6 is a flowchart of an exemplary process performed by a trainingengine to train a reward predictor, according to various embodiments ofthe present teaching. The process commences in step 610 wherein weightedtraining data is obtained. Details regarding the generation of theweighted are described later with reference to FIG. 7. In step 615, acontextual feature vector is generated. The generated contextual featureis provided to the learner unit of the training engine. The process thenmoves to step 620, wherein an action from a plurality of actions ischosen in accordance with a policy. In step 625, upon choosing anaction, a reward associated with the action is learnt.

In step 630, the scoring function i.e., the reward predictor model isupdated based on the chosen action and the learnt reward. Further, instep 635, the policy is updated based on at least one of the chosenaction and the learnt reward, and the weighted training data. In step640, a query is performed to determine whether training is completee.g., check whether a predetermined number of training iterations havebeen completed. If the response to the query is affirmative, the processmoves to step 645, wherein a trained scoring function is obtained.However, if the response to the query of step 640 is negative, theprocess loops back to step 615 to execute another training iteration.

FIG. 7 depicts an exemplary system diagram of a training data generator,according to various embodiments of the present teaching. The trainingdata generator includes an interaction log database 710, a training datagenerator 720, and a weight computing unit 730. The interaction logdatabase includes training data records associated with a user. Thetraining data records be used for unbiased estimation of expectedrewards. Each training data record in the interaction log database 710is represented as a four-tuple and corresponds to a slotting decisionpreviously taken.

By one embodiment, the four-tuple is represented as (a, p, x, r),wherein the parameter a corresponds to an action taken i.e., whichcontent item (video, article, slideshow, etc.) was shown at a particularslot position to the user. The parameter p corresponds to a probabilitythat the content item was slotted, the parameter x corresponds to afeature vector of the content item associated with a context (e.g.,time, location, preference, etc.), and the parameter r corresponds tothe score associated with the content item.

By one embodiment, the weight computing unit 730 extracts the parameterp for each training data record and computes a weight that is to beassociated with the training data record. The weight computed by theweight computing unit 730 is inversely proportional to the parameter pi.e., probability. For instance, by one embodiment, the weight computingunit 730 computes the weight for each training data record as: w=1/p.The training data generator 720 associates the weight computed by theweight computing unit 730 to each training data record to generateweighted training data 740.

It must be appreciated that the weight computed by the weight computingunit 730 is inversely proportional to the probability associated withthe content item, as the idea is that if a content item has a lowerprobability to be shown at a certain position, then the content item hasto be treated more importantly than content item frequently shown.Accordingly, in training of the reward predictor, one can take theweight of each training data record into consideration to determinewhether an exploration or exploitation action is to be taken. In thismanner, the training engine as described previously with reference toFIG. 5 provides a guided exploration mechanism to understand the user'spreference. Moreover, the training engine is less likely to be affectedby the noise from existing system and setup. In practice, assigninghigher weights to rare data records also alleviates the data imbalanceproblem.

FIG. 8 is a flowchart of an exemplary process performed by a trainingdata generator, according to various embodiments of the presentteaching. The process commences in step 810, wherein a plurality ofinteraction records (i.e., training data records) of a user areobtained. In step 820, a parameter is extracted from each training datarecord. Note that each training data record is represented as afour-tuple as described previously.

The process then moves to step 830, wherein a weighting factor iscomputed for each training data record. By one embodiment, the weighingfactor is computed based on the parameter extracted in step 820.Further, in step 840, the computed weighing factor is associated withthe training data record to generate weighted training data associatedwith the user.

Turning now to FIG. 10, there is depicted an architecture of a mobiledevice 1000, which can be used to realize a specialized systemimplementing the present teaching. In this example, a user device onwhich the functionalities of the various embodiments described hereincan be implemented is a mobile device 1000, including, but not limitedto, a smart phone, a tablet, a music player, a handled gaming console, aglobal positioning system (GPS) receiver, and a wearable computingdevice (e.g., eyeglasses, wrist watch, etc.), or in any other formfactor.

The mobile device 1000 in this example includes one or more centralprocessing units (CPUs) 1040, one or more graphic processing units(GPUs) 1030, a display 1020, a memory 1060, a communication platform1010, such as a wireless communication module, storage 1090, and one ormore input/output (I/O) devices 1050. Any other suitable component,including but not limited to a system bus or a controller (not shown),may also be included in the mobile device 1000. As shown in FIG. 10, amobile operating system 1070, e.g., iOS, Android, Windows Phone, etc.,and one or more applications 1080 may be loaded into the memory 1060from the storage 1090 in order to be executed by the CPU 1040. Theapplications 1080 may include a browser or any other suitable mobileapps for performing the various functionalities on the mobile device1000. User interactions with the content displayed on the display panel1020 may be achieved via the I/O devices 1050.

To implement various modules, units, and their functionalities describedin the present disclosure, computer hardware platforms may be used asthe hardware platform(s) for one or more of the elements describedherein. The hardware elements, operating systems and programminglanguages of such computers are conventional in nature, and it ispresumed that those skilled in the art are adequately familiar therewithto adapt those technologies. A computer with user interface elements maybe used to implement a personal computer (PC) or other type of workstation or terminal device, although a computer may also act as a serverif appropriately programmed. It is believed that those skilled in theart are familiar with the structure, programming, and general operationof such computer equipment and as a result the drawings should beself-explanatory.

FIG. 11 is an illustrative diagram of an exemplary computer systemarchitecture, in accordance with various embodiments of the presentteaching. Such a specialized system incorporating the present teachinghas a functional block diagram illustration of a hardware platform whichincludes user interface elements. Computer 1100 may be a general-purposecomputer or a special purpose computer. Both can be used to implement aspecialized system for the present teaching. Computer 1100 may be usedto implement any component(s) described herein. For example, the presentteaching may be implemented on a computer such as computer 1100 via itshardware, software program, firmware, or a combination thereof. Althoughonly one such computer is shown, for convenience, the computer functionsrelating to the present teaching as described herein may be implementedin a distributed fashion on a number of similar platforms, to distributethe processing load.

Computer 1100, for example, may include communication ports 1150connected to and from a network connected thereto to facilitate datacommunications. Computer 1100 also includes a central processing unit(CPU) 1120, in the form of one or more processors, for executing programinstructions. The exemplary computer platform may also include aninternal communication bus 1110, program storage and data storage ofdifferent forms (e.g., disk 1170, read only memory (ROM) 1130, or randomaccess memory (RAM) 1140), for various data files to be processed and/orcommunicated by computer 1100, as well as possibly program instructionsto be executed by CPU 1120. Computer 1100 may also include an I/Ocomponent 1160 supporting input/output flows between the computer andother components therein such as user interface elements 1180. Computer1100 may also receive programming and data via network communications.

Hence, aspects of the present teaching(s) as outlined above, may beembodied in programming. Program aspects of the technology may bethought of as “products” or “articles of manufacture” typically in theform of executable code and/or associated data that is carried on orembodied in a type of machine readable medium. Tangible non-transitory“storage” type media include any or all of the memory or other storagefor the computers, processors or the like, or associated modulesthereof, such as various semiconductor memories, tape drives, diskdrives and the like, which may provide storage at any time for thesoftware programming.

All or portions of the software may at times be communicated through anetwork such as the Internet or various other telecommunicationnetworks. Such communications, for example, may enable loading of thesoftware from one computer or processor into another, for example, froma management server or host computer of the content blending engine intothe hardware platform(s) of a computing environment or other systemimplementing a computing environment or similar functionalities inconnection with blending of content. Thus, another type of media thatmay bear the software elements includes optical, electrical andelectromagnetic waves, such as used across physical interfaces betweenlocal devices, through wired and optical landline networks and overvarious air-links. The physical elements that carry such waves, such aswired or wireless links, optical links or the like, also may beconsidered as media bearing the software. As used herein, unlessrestricted to tangible “storage” media, terms such as computer ormachine “readable medium” refer to any medium that participates inproviding instructions to a processor for execution.

Hence, a machine-readable medium may take many forms, including but notlimited to, a tangible storage medium, a carrier wave medium or physicaltransmission medium. Non-volatile storage media include, for example,optical or magnetic disks, such as any of the storage devices in anycomputer(s) or the like, which may be used to implement the system orany of its components as shown in the drawings. Volatile storage mediainclude dynamic memory, such as a main memory of such a computerplatform. Tangible transmission media include coaxial cables; copperwire and fiber optics, including the wires that form a bus within acomputer system. Carrier-wave transmission media may take the form ofelectric or electromagnetic signals, or acoustic or light waves such asthose generated during radio frequency (RF) and infrared (IR) datacommunications. Common forms of computer-readable media thereforeinclude for example: a floppy disk, a flexible disk, hard disk, magnetictape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any otheroptical medium, punch cards paper tape, any other physical storagemedium with patterns of holes, a RAM, a PROM and EPROM, a FLASH-EPROM,any other memory chip or cartridge, a carrier wave transporting data orinstructions, cables or links transporting such a carrier wave, or anyother medium from which a computer may read programming code and/ordata. Many of these forms of computer readable media may be involved incarrying one or more sequences of one or more instructions to a physicalprocessor for execution.

Those skilled in the art will recognize that the present teachings areamenable to a variety of modifications and/or enhancements. For example,although the implementation of various components described above may beembodied in a hardware device, it may also be implemented as a softwareonly solution—e.g., an installation on an existing server. In addition,the content blending engine, as disclosed herein, may be implemented asa firmware, firmware/software combination, firmware/hardwarecombination, or a hardware/firmware/software combination.

While the foregoing has described what are considered to constitute thepresent teachings and/or other examples, it is understood that variousmodifications may be made thereto and that the subject matter disclosedherein may be implemented in various forms and examples, and that theteachings may be applied in numerous applications, only some of whichhave been described herein. It is intended by the following claims toclaim any and all applications, modifications and variations that fallwithin the true scope of the present teachings.

We claim:
 1. A method, implemented on a machine having at least oneprocessor, storage, and a communication platform capable of connectingto a network for training a model used for estimating a reward for acontent item, the method comprising: obtaining, by a training engine,training data; and performing, by the training engine, trainingiterations each of which includes: selecting, from a plurality ofactions, an action associated with the content item according to apolicy, obtaining a reward associated with the selected action, updatingthe model based on the obtained reward, and updating the policy based onat least one of the selected action, the obtained reward, and theobtained training data.